Univariate Plots Section

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

meaning of each variable

  • Fixed Acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • Volatile Acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • Citric Acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • Residual Sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • Chlorides: the amount of salt in the wine
  • Free Sulfur Dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • Total Sulfur Dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • Density: the density of water is close to that of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  • Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
  • Alcohol: the percent alcohol content of the wine
  • Quality: score between 0 and 10

Red wine dataset consists of 13 variables, with 1599 observations (records).

The quality of wine ranges from 0 to 10. Only wines in quality from 3 to 8 exists in this dataset. Quality score, 5 and 6, are the most common ones. From that points, the number of wines decreases at bigger chance when quality score increases or decreases. Finally, it seems to be hard to find many wines marked as quality score 3 and 8. I wonder what makes to reduce the number of wines between qualities like 3 and 4, or 4 and 5, or 6 and 7, or 7 and 8. But, first, I want to know what this plot looks across other variables.

Two graphs have similar shape like having mountain in the middle, and the mountain shape indicates the range of where the majority number of wines are located. On the left and right side of the most common values, the count value rapidly decreases. This kind of trend seems similar to the quality value. I wonder how they are related.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The minimum value is 0, and the maximum value is 1 for citric acid. With the 20 bins, it looks the value fluctuates. After larging the number of bin, there are particularly many wines at citric acid value 0 and 0.5. I wonder what quality the wines in those citric acid values have.

The shape of the graph looks similar to fixed acidity and volatile acidity. I wonder how these 3 variables are related.

summary(rwd$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most wines have a chloride level between 0.07 and 0.09: median 0.079 and mean 0.08747. Even with log scaled distribution, lots of data are crowded in a very small range of chloride. I wonder if the most common chloride level is related to the worse or better quality.

Dioxide is measured in 2 different parts, free sulfur dioxide and total sulfur dioxide. They seem to show similar trend since total sulfur dioxide is the superset. I wonder wheather total sulfur dioxide has more impact on determining the quality of wine than just with free sulfur dioxide. For example, free sulfur dioxide graph shows a sink around value between 8 and 10. But total sulfur dioxide looks like more complete form.

summary(rwd$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

It is said that density is depending on the percent alcohol and sugar content. Residual sugar shows similar trend. I wonder if alcohol also shows similar trend. I also wonder if density would have very close trend when I combine residual sugar and alcohol values together.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1085 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

alcohol shows somewhat unexpected result comparing to density and residual sugar. After drawing the same graph with bigger bins, I get interested in how the values are distributed. There are very small number or zero number of wine at some alcohol levels. It seems trending like having lots of wine 3 times then no wine and so on. So, I zoomed in only a specific range between 9.1 and 9.7. Alcohol levels seems descrete like there are just few wines in between 9.2 and 9.3, or 9.5 and 9.6.

summary(rwd$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
table(rwd$alcohol)
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1

After counting all wines for each alcohol level, I see there is very small number of wine or zero number of wine when alcohol level has more than 2 decimal point. I wonder wines belonging to these alcohol level have anything to do with quality.

## Warning: Removed 1564 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

As said that most wines are between 3-4 on the pH scale, I see there are wines under 3.0 of pH. Since they are small in count, I wonder if wines belonging to under 3.0 of pH have good or bad quality.

summary(rwd$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

It looks like most wines have sulphate between 0.5 and 0.9.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 diamonds in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). The “quality” variable can be represented as a factor variable.

(worst) -> (best)

Quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Observations:

  • Most wines have quality between 5 and 6.

  • About 75% of wines have alcohol more than 9.5%.

  • Most wines have chlorides less than 0.1.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is “Quality” of wine. I would like to predict what combination of values from different variable makes a better or a worse wine in quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The meanings of all the variable are very new to me, so I only can tell based on the univariate graphs. The count of wine categorized by its quality shows a moutain shape, so I probably will investigate variables having similar shape. They are sulphates, pH, density, chlorides, residual sugar, fixed acidity, and volatile acidity.

Did you create any new variables from existing variables in the dataset?

Not necessarily. The only consideration is to make quality variable to be a factor variable, but it looks like values under quality are precisely discrete. Therefore, I didn’t change so, but I could change in the future analysis for any reason.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

For the variable, “alcohol”, I zoomed the distribution in a bit. I found that alcohol is not a continuous variable. 2nd decimal point is rarely appeared whereas the alcohol value jumps by the 1st decimal point.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.256       0.672
## volatile.acidity            -0.256            1.000      -0.552
## citric.acid                  0.672           -0.552       1.000
## residual.sugar               0.115            0.002       0.144
## chlorides                    0.094            0.061       0.204
## free.sulfur.dioxide         -0.154           -0.011      -0.061
## total.sulfur.dioxide        -0.113            0.076       0.036
## density                      0.668            0.022       0.365
## pH                          -0.683            0.235      -0.542
## sulphates                    0.183           -0.261       0.313
## alcohol                     -0.062           -0.202       0.110
## quality                      0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                1.000

The most correlated variable’s coefficiency is 0.476, and that is alcohol. Therefore, there aren’t a variable strongly correlated to the quality by looking at the coefficiency chart above. There are variables showing somewhat weekly related, but most of variables seem not correlated.

I will go through variables in order of higher coefficiency to the quality. Therefore, the first one should be alcohol, and the second is volatile acidity. I probably should go over citric acid and sulphates because those show relatively strong relationship. I am not really sure if I have to inspect other than that. However, I also should inspect variables related to alcohol, volatile acidity, citric acid, and sulphates.

Here, volatile acidity has a increasing trend when lower quality is given, meaning that higher quality wines tend to have less volatile acidity. However, I have to keep in mind that there are a number of less quality wines having less volatile acidity as much as the wines ranked as better one. I wonder what would that be to make a less quality wine when it has less volatile acidity.

Alcohol shows a opposite trend to the volatile acidity. That is quality increases when the percentage of alcohol increases (positive). As investigated above in the univariate section, why the alcohol percentage looks somewhat descrete is that it increases by 1-decimal point for the most of time. Correlationship between quality and alcohol also makes me to wonder what factors lower the quality of wine even though it has high alcohol percentage.

Wines with certain quality should have more than certain amount of citric acid. As shown above, wines with quality below 5 rarely have citric acid of more of 0.25. I think it clearly shows a trend. If wines have more than 0.25 of citric acid, it looks like to be guaranteed to get more than or equal to quality of 5. But still, I wonder what makes wines to have better quality even though low level of citric acid is present.

Sulphates also shows increasing relationship with quality. Even though correfficiency is lower than others, this one looks clearly shows the relationship better. When outliers are removed, I think it could give clearer look like below.

## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
## Warning: Removed 88 rows containing non-finite values (stat_summary).
## Warning: Removed 94 rows containing missing values (geom_point).

## Warning: Removed 88 rows containing non-finite values (stat_density).

##        name Var2  Freq
## 1 sulphates    A 0.409

After removing outliers in sulphates, the coefficiency increased from 0.251 to 0.409.

##                   name Var2   Freq
## 1        fixed.acidity    A -0.268
## 2       residual.sugar    A -0.116
## 3  free.sulfur.dioxide    A -0.152
## 4 total.sulfur.dioxide    A  0.084
## 5              density    A -0.222
## 6                   pH    A  0.038

Speaking of outliers, it is worth to look into other variables with low coefficiency. However, after removing some of outliers for the rest of variables, it is hard to see a big change.

Just in case that I could find some valuable information from the rest of variables, I am going through each one very shortly.

As shown in the graph, the median value of the fixed acidity increases by the quality. On the other hand, the residual sugar doesn’t look like showing either positive or negative relationships with quality.

The median value of free sulfur dioxide increases by quality, but it stops increasing when the quality reaches 5. It slightly decreases by then. Better quality wines seem to have less free sulfur dioxide. There should be somthing to make wines worse even when free sulfur dioxide is low.

The overall trend of total sulfur dioxide is very similar to the free sulfur dioxide.

Both of the density and pH have a negative relationship with quality. The density shows a weeker relationship than pH.

Beside the main variable, quality, there are variables related to each other. It is worth to look relationships of two variables having strong relationship.

Fixed acidity increases when higher citric acid is given. On the other hand, it also increases when less pH is given.

Sulphates of wines doesn’t look like having any relationship with other variables to determine wine quality. It just itself may have strong positive relationship with wine quality.

Higher volatile acidity can be found when less citric acid is present. Likewise, higher alcohol and citric acid can be found when less density and pH are present respectively.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Volatile acidity, alcohol, citric acid, and sulphates strongly correlates with quality comparing to other variables. However, some of the other variables also show distinctive pattern influenced by the quality.

  • Positive relationship: alcohol, citric acid, sulphates, fixed acidity
  • Negative relationship: volatile acidity, density, pH
  • Mountain-lie relationship: free/total sulfur dioxide

Even though a variable has a positive relationship with quality, there are lots of wines in low quality even when the variable’s value is close to the better quality wines.

This happens almost in every variables, so it is required to inspect how the quality changes not by only variable but by multi variables.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • between fixed acidity and citric acid (positive)
  • between fixed acidity and pH (negative)
  • between volatile acidity and citric acid (negative)
  • between alcohol and density (negative)
  • between citric acid and pH (negative)

Multivariate Plots Section

The relationship between alcohol and density almost doesn’t show any trend. The alcohol itself mostly contribute to the quality decision. There are still lots of similar quality wines with similar density values regardless of alcohol value.

The relationship between citric acid and fixed acidity somewhat clearly show the quality trend. For instance, when higher fixed acdity and higher citric acid are present, a wine seems to become a better quality wine. However, it is interesting that there are stil good wines with low fixed acidity and low citric acid.

The relationship between citric acid and volatile acidity is opposite to the previous one (citric acid v.s. fixed acidity).

We’ve seen that citric acid has a positive relationship with quality, so it is obvious there are more of better quality wines with higher value of citric acid. However, on the plot above, pH doesn’t play an important role to determine quality. For example, for the same value of range of citric acid, the value of pH doesn’t separate wines into better and less better groups.

Final Plots and Summary

Plot One

Description One

The distribution of wine quality appears to be normal. Quality between 5 and 6 are most commonly found. On the other hand it is rare to find wines in range of 3 and 4 or 7 and 8. It is probably hard to produce good wine, and it is probably hard to produce bad wine as well. Perhaps, it is very easy to achieve just normal quality wine.

Plot Two

Description Two

Sulphates and alcohol don’t create relationship with other variables to determine wine quality. Rather, it only itself creates a strongly positive relationship with quality. Additionally, as described, sulphates is a wine additive meaning it is not situated naturally. Probably, that is why it doesn’t have any relationship with other variables to assess wine quality. Additionally, high level of sulphates may be hard to be added to wines.

Plot Three

Description Three

High fixed acidity, low volatile acidity, and high citric acid, seem to be important to determine wine quality.

Reflection

The red wine data set contains information on almost 1,599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables and created a linear model to predict diamond prices.

There was clear two kinds of trend. The one is between the sulphates/alcohol of a wine and its quality, and the other one is between fixed acidity, volatile acidity, citric acid, and quality. I was surprised that lots of features, especially alcohol level, of a wine play an important role to determine its quality. Also, I was surprised that sulphates of a wine has a positive relationship with its quality because I though it could harm the wine. It is also notable wines’s alcohol level has positive relationship with its quality.

Some limitations of this model include the source of the data. Given that the number of observation may not be big enough to determine very clear trends. To investigate this data further, I would examine other dataset provided for white wine quality. I could expect to find some common sense between them so that I could support my concolusion stronger.